CEMA Internship Task
1 Environment set up
# Avoid printing of numbers in scientific format
options(scipen = 999)
# Required Packages
pacman::p_load(tidyverse, janitor, dlookr, naniar, plotly, timetk, trelliscopejs)
# Loading data
cema <- read_csv("data/cema_internship_task_2023.csv", show_col_types = F)
# Set default plotting theme
theme_set(theme_minimal())2 Some Data cleaning
3 Exploratory analysis
3.1 Dataset Structure
We have 1410 rows and 11 variables in the data as the ouput below shows.
## Rows: 1,410
## Columns: 11
## $ period <chr> "Jan-23", "Jan-23", "Jan-23", "Jan-23", "Ja…
## $ county <chr> "Baringo County", "Bomet County", "Bungoma …
## $ `Total Dewormed` <dbl> 3659, 1580, 6590, 7564, 1407, 3241, 6751, 4…
## $ `Acute Malnutrition` <dbl> 8, NA, 24, NA, NA, 72, 250, 9, 26, 104, 36,…
## $ `stunted 6-23 months` <dbl> 471, 1, 98, 396, 92, 326, 40, 209, 51, 319,…
## $ `stunted 0-<6 months` <dbl> 34, 3, 154, 143, 71, 86, 13, 87, 6, 102, 27…
## $ `stunted 24-59 months` <dbl> 380, NA, 23, 111, 5, 24, 99, 58, 50, 155, 2…
## $ `diarrhoea cases` <dbl> 2620, 1984, 4576, 2239, 2739, 1376, 2314, 2…
## $ `Underweight 0-<6 months` <dbl> 85, 41, 231, 251, 57, 141, 223, 140, 13, 13…
## $ `Underweight 6-23 months` <dbl> 739, 86, 315, 608, 104, 544, 1856, 298, 180…
## $ `Underweight 24-59 Months` <dbl> 731, 16, 120, 125, 21, 160, 1833, 84, 271, …
3.2 Exploring Missingness
The plot below shows the percentage of missing values in each
variable in our dataset. Most values are missing in the variable
acute_malnutrition (about 25%). Other variables with
missing values are stunted 0-<6 months,
stunted 24-59 months, and stunted 6-23 months
with 1.35%, 0.99%, and 0.78% missingness respectively. You can hover
over the plot to see these values.
3.3 Univariate Data EDA
3.3.1 Lets compute descriptive statistics to help determine the distribution of numerical variables.
We can see the variable stunted_0_6_months has a large
positive skewness. For modeling reasons we should consider a log or sqrt
transformation on this variable to follow the normal distribution.
We can take the above analysis further by grouping the data by county and computing the descriptive statistics.
3.4 EDA of bivariate data
3.4.1 Calculation of correlation coefficient
The following is a correlation matrix show the association of the numeric variables in the dataset. All of them are positively correlated.
##
Analysis of Numeric variables over time
Lastly, as part of our exploratory analysis, we can leverage the period column to explore how the numeric indicators are changing over time by county. We can do this by plotting each numeric indicator against time, for each county. In the chunk below we filter for rows where county is Embu and plot the total dewormed children over time. The plot shows a relatively constant trend with conspicuous hikes in every May and November of every year.
ggplotly(cema_clean %>%
filter(county == "Embu County") %>%
ggplot(aes(period, total_dewormed))+
geom_line()+
geom_point()+
labs(
title = "Total Dewormed Children Over Time", x = "Date", y = "Total Dewormed"
)+
theme_bw()
)Since we have several counties (47) in the data, visualizing all of them in one plot will be will create a visual hard to interpret. We can leverage some javascript to scale the faceting to all 47 counties and indicators. This will create an interactive display which one can search, sort, and filter for any indicator by county.
cema_clean %>%
mutate(county = str_remove(county, " County")) %>%
pivot_longer(
3:last_col(), names_to = "indicator", values_to = "value"
) %>%
unite("analytic", county, indicator, sep = " - ") %>%
group_by(analytic) %>%
plot_time_series(
period, value,
.interactive = F, .x_intercept = "Date", .y_lab = "Value",
.title = "General Trend by County and Indicator",
.smooth = F
)+
geom_point()+
# facet by all groups in the pivoted data
facet_trelliscope(~ analytic, scales = "free", width = 1000, nrow = 3, ncol = 3, as_plotly = T, path = "./")4 Research Question
Is there any association between the occurrence of diarrhea cases in children under 5 years and the number of children dewormed in each county?
4.1 Correlation between the occurrence of diarrhea cases and the number of children dewormed in each county
In the code chunk below we first remove the unnecessary
county string in the county column, then we filter out rows
with missing values in the diarrhoea_cases and
total_dewormed columns, as we need complete data to
calculate the correlation.
After that, we group the filtered data by county and calculate the
correlation coefficient (Pearson correlation) between
diarrhoea_cases and total_dewormed using the
cor function.
cema_clean %>%
mutate(county = str_remove(county, " County")) %>%
drop_na(diarrhoea_cases, total_dewormed) %>%
group_by(county) %>%
summarize(correlation = cor(diarrhoea_cases, total_dewormed, use = "complete.obs")) %>%
ggplot(aes(fct_reorder(county, correlation), correlation))+
geom_col(fill = "steelblue")+
coord_flip()+
theme_bw()+
labs(
title = "Association between the occurrence of diarrhea cases in children under 5 years and the number of \n children dewormed by county",
x = "County", y = "Correlation Coefficient"
)+
theme(
axis.text = element_text(face = "bold"),
plot.title = element_text(vjust = .5, size = 14, face = "bold", colour = "steelblue")
)4.1.1 Interpretation
We see a correlation coefficient close to 1 (positive) in counties like Samburu, Busia, Kisii and Turkana. This suggests a strong positive association between Diarrhoea casess and the number of dewormed children in these counties. In other words, as the number of dewormed children increase in these counties, the number of diarrhea cases among children under 5 years will most likely increase.
We also observer a weak negative correlation coefficient in counties like Nyeri, Kakamega, Tharaka Nithi and Kitui. This suggests a weak negative association between Diarrhoea Cases and the number of dewormed children in these counties. This means as the number of dewormed children increase in these counties, the number of diarrhea cases among children under 5 years will most likely decrease.